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Abstract 

Given a similarity graph between items, correlation clustering (CC) groups similar items together and 
dissimilar ones apart. One of the most popular CC algorithms is KwikCluster. an algorithm that serially 
clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in 
practice requires a large number of clustering rounds, a potential bottleneck for large graphs. 

We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a 
polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency 
control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. 
ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better 
scaling; this leads to a provably small loss in the 3-approximation ratio. 

We provide extensive experimental results for both algorithms, where we outperform the state of the 
art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster 
billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup. 


1 Introduction 

Clustering items according to some notion of similarity is a major primitive in machine learning. Correlation 
clustering serves as a basic means to achieve this goal: given a similarity measure between items, the goal 
is to group similar items together and dissimilar items apart. In contrast to other clustering approaches, 
the number of clusters is not determined a priori, and good solutions aim to balance the tension between 
grouping all items together versus isolating them. 

The simplest CC variant can be described on a 
complete signed graph. Our input is a graph G on n 
vertices, with -|-1 weights on edges between similar 
items, and —1 edges between dissimilar ones. Our 
goal is to generate a partition of vertices into dis¬ 
joint sets that minimizes the number of disagreeing 
edges: this equals the number of “-I-” edges cut by 
the clusters plus the number of ” edges inside the 
clusters. This metric is commonly called the number 
of disagreements. In Figure 1, we give a toy example 
of a CC instance. 

Entity deduplication is the archetypal motivat¬ 
ing example for correlation clustering, with applica¬ 
tions in chat disentanglement, co-reference resolution, and spam detection [HElEllllElE]. The input is a 
set of entities (say, results of a keyword search), and a pairwise classifier that indicates—with some error— 
similarities between entities. Two results of a keyword search might refer to the same item, but might look 
different if they come from different sources. By building a similarity graph between entities and then ap¬ 
plying CC, the hope is to cluster duplicate entities in the same group; in the context of keyword search, this 


cluster 1 cluster 2 



cost = ” edges inside clusters) + (#“+” edges across clusters) = 2 

Figure ll In the above graph, solid edges denote similarity 
and dashed dissimilarity. The number of disagreeing edges in 
the above clustering is 2; we color these edges with red. 
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implies a more meaningful and compact list of results. CC has been further applied to finding communities 
in signed networks, classifying missing edges in opinion or trust networks di, gene clustering [3], and 
consensus clustering [3]. 

KwikCluster is the simplest CC algorithm that achieves a provable 3-approximation ratio m, and works 
in the following way: pick a vertex v at random (a cluster center), create a cluster for v and its positive 
neighborhood N{v) (i.e., vertices connected to v with positive edges), peel these vertices and their associ¬ 
ated edges from the graph, and repeat until all vertices are clustered. Beyond its theoretical guarantees, 
experimentally KwikCluster periorms well when combined with local heuristics [3]. 

KwikCluster seems like an inherently sequential algorithm, and in most cases of interest it requires many 
peeling rounds. This happens because a small number of vertices are clustered per round. This can be a 
bottleneck for large graphs. Recently, there have been efforts to develop scalable variants of KwikCluster 
OE]. In m a distributed peeling algorithm was presented in the context of MapReduce. Using an elegant 
analysis, the authors establish a (3-|-e)-approximation in a polylogarithmic number of rounds. The algorithm 
employs a simple step that rejects vertices that are executed in parallel but are “conflicting”; however, we 
see in our experiments, this seemingly minor coordination step hinders scale-ups in a parallel core setting. 
In the tutorial of [S], a sketch of a distributed algorithm was presented. This algorithm achieves the same 
approximation as KwikCluster, in a logarithmic number of rounds, in expectation. However, it performs 
significant redundant work, per iteration, in its effort to detect in parallel which vertices should become 
cluster centers. 

Our contributions We present C4 and ClusterWild!, two parallel CC algorithms that in practice 
outperform the state of the art, both in terms of running time and clustering accuracy. C4 is a parallel 
version of KwikCluster that uses concurrency control to establish a 3-approximation ratio. ClusterWild! 
is a simple to implement, coordination-free algorithm that abandons consistency for the benefit of better 
scaling, while having a provably small loss in the 3 approximation ratio. 

C4 achieves a 3 approximation ratio, in a poly-logarithmic number of rounds, by enforcing consistency 
between concurrently running peeling threads. Consistency is enforced using concurrency control, a notion 
extensively studied for databases transactions, that was recently used to parallelize inherently sequential 
machine learning algorithms m- 

ClusterWild! is a coordination-free parallel CC algorithm that waives consistency in favor of speed. 
The cost that we pay is an arbitrarily small loss in ClusterWild!’s accuracy. We show that ClusterWild! 
achieves a (3-|-e)0PT + 0{e-n-log^ n) approximation, in a poly-logarithmic number of rounds, with provable 
nearly linear speedups. Our main theoretical innovation for ClusterWild! is analyzing the coordination- 
free algorithm as a serial variant of KwikCluster that runs on a “noisy” graph. 

In our extensive experimental evaluation, we demonstrate that our algorithms gracefully scale up to 
graphs with billions of edges. In these large graphs, our algorithms output a valid clustering in less than 
5 seconds, on 32 threads, up to an order of magnitude faster than KwikCluster. We observe how, not 
unexpectedly, ClusterWild! is faster than C4, and quite surprisingly, abandoning coordination in this 
parallel setting, only amounts to a 1% of relative loss in the clustering accuracy. Furthermore, we compare 
against state of the art parallel CC algorithms, showing that we consistently outperform these algorithms in 
terms of both running time and clustering accuracy. 

Notation G denotes a graph with n vertices and m edges. G is complete and only has ±1 edges. We 
denote by dy the positive degree of a vertex, i.e., the number of vertices connected to v with positive edges. 
A denotes the positive maximum degree of G, and N(v) denotes the positive neighborhood of u; moreover, 
let C'„ = {u, A(u)}. Two vertices u, v are neighbors in G if u G N{v) and vice versa. We denote by tt a 
permutation of {1,..., n}. 

2 Two Parallel Algorithms for Correlation Clustering 

The formal definition of correlation clustering is given below. 


2 


Correlation Clustering. Given a graph G on n vertices, partition the vertices into an arbitrary number k 
of disjoint subsets Ci,... ,Ck such that the sum of negative edges within the subsets plus the sum of positive 
edges across the subsets is minimized: 


k 


k 


k 


OPT = min min 

l<k<n 


Y,E-{C,,Ci)+Y, ^ E+{C.,C,) 


where E'^ and E are the sets of positive and negative edges in G. 

KwikCluster is a remarkably simple algorithm that approximately solves the above combinatorial problem, 
and operates as follows. A random vertex v is picked, a cluster Cy is created with v and its positive 
neighborhood, then the vertices in C„ are peeled from the graph, and this process is repeated until all vertices 
are clustered. KwikGluster can be equivalently executed, as noted by [2, if we substitute the random choice 
of a vertex per peeling round, with a random order tt preassigned to vertices, (see Alg. [^. That is, select a 
random permutation on vertices, then peel the vertex indexed by 7r(l), and its neighbors. Remove from tt 
the vertices in C„ and repeat this process. Having an order among vertices makes the discussion of parallel 
algorithms more convenient. 


2.1 C4- Parallel CC using Concurrency Control 


Algorithm 1 KwikGluster with tt 


TT = a random permutation of {1, ..., n} 

while C ^ 0 do 

select the vertex v indexed by 7r(l) 

Cy = {v,N{v)} 

Remove clustered vertices from G and tt 

end while 


Suppose we now wish to run a parallel version of 
KwikGluster, say on two threads: one thread picks 
vertex v indexed by 7r(l) and the other thread picks 
u indexed by 7r(2), concurrently. Can both vertices 
be cluster centers? They can, if and only if they are 
not neighbors in G. If v and u are connected with 
a positive edge, then the vertex with the smallest 
order wins. This is our concurency rule no. 1. Now, 
assume that v and u are not neighbors in G, and 

both V and u become cluster centers. Moreover, assume that v and u have a common, unclustered neighbor, 
say w: should w be clustered with v, or u? We need to follow what would happen with KwikGluster in Alg.jl] 
w will go with the vertex that has the smallest permutation number, in this case v. This is concurency rule 
no. 2. Following the above simple rules, we develop Cj, our serializable parallel CC algorithm. Since, 
G4 constructs the same clusters as KwikCluster (for a given ordering tt), it inherits its 3 approximation by 
design. The above idea of identifying the cluster centers in rounds was first used in [12] to obtain a parallel 
algorithm for maximal independent set (MIS). 

Cj, shown as Alg. starts by assigning a random permutation tt to the vertices, it then samples an 
active set A of ^ unclustered vertices; this sample is taken from the prefix of tt. After sampling A, each of 
the P threads picks a vertex with the smallest order in A, then checks if that vertex can become a cluster 
center. We first enforce concurrency rule no. 1: adjacent vertices cannot be cluster centers at the same 
time. C4 enforces it by making each thread check the neighbors of the vertex, say v, that is picked from A. 
A thread will check in attemptCluster whether its vertex v has any preceding neighbors (according to tt) 
that are cluster centers. If there are none, it will go ahead and label v as cluster center, and proceed with 
creating a cluster. If a preceding neighbor of is a cluster center, then v is labeled as not being a cluster 
center. If a preceding neighbor of v, call it u, has not yet received a label (i.e., u is currently being processed 
and is not yet labeled as cluster center or not), then the thread processing v, will wait on u to receive a 
label. The major technical detail is in showing that this wait time is bounded; we show that no more than 
O(logn) threads can be in conflict at the same time, using a new subgraph sampling lemma [13) . Since C4 
is serializable, it has to respect concurrency rule no. 2: if a vertex u is adjacent to two cluster centers, then 
it gets assigned to the one with smaller permutation order. This is accomplished in createCluster. After 
processing all vertices in A, all threads are synchronized in bulk, the clustered vertices are removed, a new 
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active set is sampled, and the same process is repeated until everything has been clustered. In the following 
section, we present the theoretical guarantees for C4- 


Algorithm 2 C4 Sz ClusterWild! 

1: Input: G,e 

2: clusterlD(l) = ... = clusterlD(n) = oo 
3: TT = a random permutation of {1,..., n} 

4: while y ^ 0 do 

5: A = maximum vertex degree in G{V) 

6: A = the first e • J vertices in y[7r]. 

7: while A ^ 0 do in parallel 

8: V = first element in A 

9: A = A — {■(;} 

10: if C4 then 

11: attemptCluster(t;) 

12: else if ClusterWild! then 

13: createCluster(n) 

14: end if 

15: end while 

16: Remove clustered vertices from V and tt 

17: end while 

18: Output: {clusterlD(l),..., clusterlD(n)}. 


createCluster(n): 
clusterID(t!) = tt{v) 
for u e r(n) \ A do 

clusterlD('u) = min(clusterID(M), 7r(ti)) 

end for 

attemptCluster(n): 

if clusterlD(u) = oo and isCenter(n) then 
createCluster (v) 
end if 

isCenter(r>): 

for u € r(v) do //check friends (in order of tt) 
if 7r(u) < 7r(v) then //if they precede you, wait 
wait until clusterlD(M) ^ oo //till clustered 
if isCenter(M) then 

return 0 //a friend is center, so you can’t be 
end if 
end if 
end for 

return 1 //no earlier friend are centers, so you are 


2.2 ClusterWild!: Coordination-free Correlation Clustering 

ClusterWild! speeds up computation by ignoring the first concurrency rule. It uniformly samples un¬ 
clustered vertices, and builds clusters around all of them, without respecting the rule that cluster centers 
cannot be neighbors in G. In ClusterWild!, threads bypass the attemptCluster routine; this eliminates 
the “waiting” part of C4- ClusterWild! samples a set A of vertices from the prefix of tt. Each thread 
picks the first ordered vertex remaining in A, and using that vertex as a cluster center, it creates a cluster 
around it. It peels away the clustered vertices and repeats the same process, on the next remaining vertex 
in A. At the end of processing all vertices in A, all threads are synchronized in bulk, the clustered vertices 
are removed, a new active set is sampled, and the parallel clustering is repeated. A careful analysis along 
the lines of [S] shows that the number of rounds (i.e., bulk synchronization steps) is only poly-logarithmic. 

Quite unsurprisingly, ClusterWild! is faster than C4- Interestingly, abandoning consistency does not 
incur much loss in the approximation ratio. We show how the error introduced in the accuracy of the solution 
can be bounded. We characterize this error theoretically, and show that in practice it only translates to only 
a relative 1% loss in the objective. The main intuition of why ClusterWild! does not introduce too much 
error is that the chance of two randomly selected vertices being neighbors is small, hence the concurrency 
rules are infrequently broken. 


3 Theoretical Guarantees 

In this section, we bound the number of rounds required for each algorithms, and establish the theoretical 
speedup one can obtain with P parallel threads. We proceed to present our approximation guarantees. We 
would like to remind the reader that—as in relevant literature—we consider graphs that are complete, signed, 
and unweighted. The omitted proofs can be found in the Appendix. 
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3.1 Number of rounds and running time 

Our analysis follows those of |12j and [6] . The main idea is to track how fast the maximum degree decreases 
in the remaining graph at the end of each round. 

Lemma 1. C4 and ClusterWild! terminate a/ter O (Mogn • log A) rounds w.h.p. 

We now analyze the running time of both algorithms under a simplified BSP model. The main idea is 
that the the running time of each “super step” (i.e., round) is determined by the “straggling” thread (i.e., 
the one that gets assigned the most amount of work), plus the time needed for synchronization at the end 
of each round. 


Assumption 1. We assume that threads operate asynchronously within a round, and synchronize at the 
end of a round. A memory cell can be written/read concurrently by multiple threads. The time spent per 
round of the algorithm is proportional to the time of the slowest thread. The cost of thread synchronization 
at the end of each batch takes time 0{P), where P is the number of threads. The total computation cost is 
proportional to the sum of the time spent for all rounds, plus the time spent during the bulk synchronization 
step. 


Under this simplified model, we show that both algorithms obtain nearly linear speedup, with Clus¬ 
terWild! being faster than C4, precisely due to lack of coordination. Our main tool for analyzing C4 is a 
recent graph-theoretic result (Theorem 1 in [13]), which guarantees that if one samples an 0(n/A) subset of 
vertices in a graph, the sampled subgraph has a connected component of size at most O(logn). Combining 
the above, in the appendix we show the following result. 


Theorem 2. The theoretical running time of C4, on P cores and e = 1/2, is upper bounded by 


O 


m + n log^ n 


-I- P ) log n ■ log A 


as long as the number of cores P is smaller than min^ where is the size of the batch in the i-th 
round of each algorithm. The running time of ClusterWild! on P cores is upper bounded by 


^ ^^ m + n ^ logn-logA ^ 


for any constant e > 0. 


3.2 Approximation ratio 

We now proceed with establishing the approximation ratios of C4 and ClusterWild!. 

3.2.1 C4 is serializable 

It is straightforward that C4 obtains precisely the same approximation ratio as KwikCluster. One has to 
simply show that for any permutation tt, KwikCluster and C4 will output the same clustering. This is 
indeed true, as the two simple concurrency rules mentioned in the previous section are sufficient for C4 to 
be equivalent to KwikCluster. 

Theorem 3. C4 achieves a 3 approximation ratio, in expectation. 

3.2.2 ClusterWild! as a serial procedure on a noisy graph 

Analyzing ClusterWild! is a bit more involved. Our guarantees are based on the fact that ClusterWild! 
can be treated as if one was running a peeling algorithm on a “noisy” graph. Since adjacent active vertices 
can still become cluster centers in ClusterWild!, one can view the edges between them as “deleted,” by a 
somewhat unconventional adversary. We analyze this new, noisy graph and establish our theoretical result. 
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Theorem 4. ClusterWild! achieves a (3 + e) • OPT + 0(e • n ■ log^ n) approximation, in expectation. 

We provide a sketch of the proof, and delegate the details to the appendix. Since ClusterWild! ignores 
the edges among active vertices, we treat these edges as deleted. In our main result, we quantify the loss 
of clustering accuracy that is caused by ignoring these edges. Before we proceed, we define bad triangles, a 
combinatorial structure that is used to measure the clustering quality of a peeling algorithm. 

Definition 1. A bad triangle in G is a set of three vertices, such that two pairs are joined with a positive 
edge, and one pair is joined with a negative edge. Let Ti denote the set of bad triangles in G. 

To quantify the cost of ClusterWild!, we make the below observation. 

Lemma 5. The cost of any greedy algorithm that picks a vertex v (irrespective of the sampling order), creates 
Cy, peels it away and repeats, is equal to the number of bad triangles adjacent to each cluster center v. 

Lemma 6. Let G denote the random graph induced by deleting all edges between active vertices per round, 
for a given run of ClusterWild!, and let Tmw denote the number of additional bad triangles that G has 
eompared to G. Then, the expected cost of ClusterWild! can be upper bounded as 

IE \ 'y ^ Ip, T Tnew r ) 

UeTi J 

where Vt is the event that triangle t, with end points i,j, k, is bad, and at least one of its end points becomes 
active, while t is still part of the original unelustered graph. 

We provide the proof for the above two lemmas in the Appendix. We continue with bounding the second 
term E{Tnew} in the bound of Lemma by considering the number of new bad triangles Tnew,! created at 
each round i (in the following Ai, denotes the set of active vertices at round i): 

E{rnew,J< X! e A) • |Ai*(M)UiVj(u)| < ^ •2-Ai < 2-e^-^ < 2-e^-n 

where Ef' is the set of remaining positive and Niiv) the neighborhood of vertex v at round i, the second 
inequality is due to the fact that the size of the neighborhoods is upper bounded by A^, the maximum 
positive degree at round i, and the probability bound is true since we are sampling ^ vertices without 
replacement from a total of Ui, the number of unclustered vertices at round i; the final inequality is true 
since Ei < n ■ A^. Using the result that ClusterWild! terminates after at most lognlog A) rounds, 
we get thalQ 

E {Tnew} < 0{e ■ n ■ log^ n). 

We are left to bound 

UeTt ) 

To do that we use the following lemma. 

Lemma 7. If pt satisfies 

Ve, E 

tieCt^Tb 

then, 

Pt < cx ■ OPT. 

^We skip the constants to simplify the presentation; however they are all smaller than 10. 


= E^*- 

teTb 
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Proof. Let be one (of the possibly many) sets of edges that attribute a +1 in the cost of an optimal 
algorithm. Then, 


eSB* eeB* t-.eCteTb teTb teTb 


□ 


Now, as with [6], we will simply have to bound the expectation of the bad triangles, adjacent to an edge 
{u, v): 

E 

t:{u,v}CtGTb 

Let Su,v = U{u v}cte% ^ union of the sets of nodes of the bad triangles that contain both vertices u 

and V. Observe that if some w £ S\{u,v} becomes active before u and v, then a cost of 1 (i.e., the cost of 
the bad triangle {m,z),z(;}) is incurred. On the other hand, if either u or v, or both, are selected as pivots in 
some round, then Cu,v can be as high as [S'! — 2, i.e., at most equal to all bad triangles containing the edge 
{m,z;}. Let A^v = {u or v are activated before any other vertices in Then, 

E = E [Cu,V I Au^y] ■ + E [Cu^V I ■ ^{^U,v) 

< I + i\s\ - 2) ■F{{u,v} n A ^ ibis n A ^ fb) 

< 1 + 2\S\ ■F{v^A^ib\Sf^A^ib) 


where the last inequality is obtained by a union bound over u and v. We now bound the following probability: 

^ ^ ^ ^ P{5n^^0} P{5n^^0} 1-P{5nyf = 0}’ 

Observe that P{u S hence we need to upper bound P{5n^ = 0}. The probability, per round, 

that no positive neighbors in S become activated is upper bounded by 


(p) 


Hence, we obtain the following bound 


rP')_g(, 


P 


n — |5| + t 




n 


n/P' 


\S\n/P 


<|1 


\S\u/P 


e ■ I^I/a 
1 _ ■ 


|5|P{un^7^ 0150^7^0} < 

We now know that |5| < 2 • A + 2 and also e < 1. Then, 

Hence, we have 

E(C,,„) ^1 + 2- i_e^p|_4g}- 

The overall expectation is then bounded by 


E I E + ' 

I t^Tb 


new ( _ 


< 1 + 2 


4-e 


1 - e-4'" 

which establishes our approximation ratio for ClusterWild!. 


OPT + 0(e • n ■ log^ n) < (3 + e) • OPT + 0(e • n ■ log^ n) 


3.3 BSP Algorithms as a Proxy for Asynchronous Algorithms 
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We would like to note that the analysis under the 
BSP model can be a useful proxy for the perfor¬ 
mance of completely asynchronous variants of our 
algorithms. Specifically, see Alg. where we re¬ 
move the synchronization barriers. 

The only difference between the asynchronous 
execution in Alg. compared to Alg. is the com¬ 
plete lack of bulk synchronization, at the end of the 
processing of each active set A. Although the analy¬ 
sis of the BSP variants of the algorithms is tractable, 
unfortunately analyzing precisely the speedup of the 
asynchronous C4 and the approximation guarantees 
for the asynchronous ClusterWild! is challeng¬ 
ing. However, in our experimental section we test 
the completely asynchronous algorithms against the 
BSP algorithms of the previous section, and obsei 
accuracy of clustering, and running times. 


Algorithm 3 & ClusterWild! 

(asynchronous execution) 

1: Input: G 

2: clusterlD(l) = . . . = clusterlD(n) = oo 
3: TT — a random permutation of {1, . . . , n} 

4: while V ^ 0 do 
5: V — first element in V 

6: V^V-{v} 

7: if C4 then // concurrency control 

8: attemptCluster(u) 

9: else if ClusterWild! then // coordination free 

10: crGateCluster(u) 

11: end if 

12: Remove clustered vertices from V and tt 

13: end while 

14: Output: {clusterlD(l), . . . , clusterlD(n)}. 


that they perform quite similarly both in terms of 


4 Related Work 

Correlation clustering was formally introduced by Bansal et al. M- In the general case, minimizing dis¬ 
agreements is NP-hard and hard to approximate within an arbitrarily small constant (APX-hard) [HI ITS] . 
There are two variations of the problem: i) CC on complete graphs where all edges are present and all 
weights are ±1, and ii) CC on general graphs with arbitrary edge weights. Both problems are hard, however 
the general graph setup seems fundamentally harder. The best known approximation ratio for the latter is 
O(logu), and a reduction to the minimum multicut problem indicates that any improvement to that requires 
fundamental breakthroughs in theoretical algorithms [Ki¬ 
ln the case of complete unweighted graphs, a long series of results establishes a 2.5 approximation via a 
rounded linear program (LP) |10j . A recent result establishes a 2.06 approximation using an elegant rounding 
to the same LP relaxation Hi!. By avoiding the expensive LP, and by just using the rounding procedure of 
m as a basis for a greedy algorithm yields KwikCluster: a 3 approximation for CC on complete unweighted 
graphs. 

Variations of the cost metric for CC change the algorithmic landscape: maximizing agreements (the dual 
measure of disagreements) |14L 1181119j . or maximizing the difference between the number of agreements 
and disagreements [201121| . come with different hardness and approximation results. There are also several 
variants: chromatic CC [55], overlapping CC [53], or CC with small number of clusters and added constraints 
that are suitable for biology applications [5T] . 

The way C4 finds the cluster centers can be seen as a variation of the MIS algorithm of m-, the main 
difference is that in our case, we “passively” detect the MIS, by locking on memory variables, and by waiting 
on preceding ordered threads. This means, that a vertex only “pushes” its cluster ID and status (cluster 
center/clustered/unclustered) to its neighbors, versus “pulling” (or asking) for its neighbors’ cluster status. 
This saves a substantial amount of computational effort. A sketch of the idea of using parallel MIS algorithms 
for CC was presented in [3, where the authors suggest using Luby’s algorithm for finding an MIS, and then 
using the MIS vertices as cluster centers. However, a closer look on this approach reveals that there is 
fundamentally more work need to be done to cluster the vertices. 


5 Experiments 

Our parallel algorithms were all implemented in Scala—we defer a full discussion of the implementation 
details to Appendix]^ We ran all our experiments on Amazon EC2’s rS.Sxlarge (32 vCPUs, 244Gb memory) 







instances, using 1-32 threads. The real graphs listed in Tablej^were each tested with 100 different random tt 


Graph 

^ vertices 

# edges 

Description 

DBLP-2011 

986,324 

6,707,236 

2011 DBLP co-authorship network 125112611271. 

ENWiki-2013 

4,206,785 

101,355,853 

2013 link graph of English part of Wikipedia 125112611271. 

UK-2005 

39,459,925 

921,345,078 

2005 crawl of the .uk domain 125112611271. 

IT-2004 

41,291,594 

1,135,718,909 

2004 crawl of the .it domain 125112611271. 

WebBase-2001 

118,142,155 

1,019,903,190 

2001 crawl by WebBase crawler I25II26II27I. 


Table 1: Graphs used in the evaluation of our parallel algorithms. 

orderings. We measured the runtimes, speedups (ratio of runtime on 1 thread to runtime on p threads), and 
objective values obtained by our parallel algorithms. For comparison, we also implemented the algorithm 
presented in [5], which we denote as CDK for shori[^ Values of e = 0.1,0.5, 0.9 were used for C4 BSP, 
ClusterWild! BSP and CDK. In the interest of space, we present only representative plots of our results; 
full results are given in our appendix. 

5.1 Runtimes 

C4 and ClusterWild! are initially slower than serial, due to the overheads required for atomic operations 
in the parallel setting. However, all our parallel algorithms outperform serial KwikCluster with 3-4 threads. 
As more threads are added, the asychronous variants become faster than their BSP counterparts as there are 
no synchronization barrriers. The difference between BSP and asychronous variants is greater for smaller e. 
ClusterWild! is also always faster than C4 since there are no coordination overheads. 

5.2 Speedups 

The asynchronous algorithms are able to achieve a speedup of 13-15x on 32 threads. The BSP algorithms 
have a poorer speedup ratio, but nevertheless achieve lOx speedup with e = 0.9. 

5.3 Synchronization rounds 

The main overhead of the BSP algorithms lies in the need for synchronization rounds. As e increases, the 
amount of synchronization decreases, and with e = 0.9, our algorithms have less than 1000 synchronization 
rounds, which is small considering the size of the graphs and our multicore setting. 

5.4 Blocked vertices 

Additionally, C4 incurs an overhead in the number of vertices that are blocked waiting for earlier vertices 
to complete. We note that this overhead is extremely small in practice—on all graphs, less than 0.2% of 
vertices are blocked. On the larger and sparser graphs, this drops to less than 0.02% (i.e., 1 in 5000) of 
vertices. 

5.5 Objective value 

By design, the C4 algorithms also return the same output (and thus objective value) as serial KwikCluster. 
We find that ClusterWild! BSP is at most 1% worse than serial across all graphs and values of e. The 
behavior of asynchronous ClusterWild! worsens as threads are added, reaching 15% worse than serial for 
one of the graphs. Finally, on the smaller graphs we were able to test CDK on, we find that CDK returns a 
worse median objective value than both ClusterWild! variants. 

^ CDK was only tested on the smaller graphs of DBLP-2011 and ENWiki-2013, because CDK was prohibitively slow, often 
2-3 orders of magnitude slower than C4, ClusterWild!, and even serial KwikCluster. 
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(b) Mean runtimes, IT-2004, e = 0.5 
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Figure 2: In the above figures, ‘CW’ is short for ClusterWild!, ‘BSP’ is short for the bulk-synchronous 
variants of the parallel algorithms, and ‘As’ is short for the asynchronous variants. 
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6 Conclusions and Future Directions 


We presented two parallel algorithms for correlation clustering that admit provable nearly linear speedups 
and approximation ratios. Our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, 
while achieving a 15x speedup. The two approaches complement each other: when C4 is fast relative to 
ClusterWild!, we may prefer it for its guarantees of accuracy; and when ClusterWild! is accurate 
relative to CV, we may prefer it for its speed. 

Both C4 and ClusterWild! are well-suited for a distributed setup since they run for at most a polyloga- 
rithmic number of rounds. In the future, we intend to implement our algorithms in a distributed environment, 
where synchronization and communication often account for the highest cost. 
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A Proofs of Theoretical Guarantees 


A.l Number of rounds for C4 and ClusterWild! 

Lemma 1. C4 and ClusterWild! terminate a/ter O (Mogn • log A) rounds w.h.p. 

Proof. We split our proof in two parts. 

For ClusterWild!, we wish to upper bound the probability 


qt =V <v not clustered by round i + t 


deg,+j(u) > 


A J A 


Observe that the above event happens either if no neighbors of v become activated by round * + t, or if w 
itself does not become activated. Hence, qt can be upper bounded by the probability that no neighbors of v 
become activated by round i + t. 

In the following, let di+j denote the degree of vertex v at roudn i + j] for simplicity we drop the round 
indices on n and P. The probability, per round, that no neighbors of v become activated is equal tc0 


(" ^ {n-Py. ^ {n-dj+jy 

(”) {n-P-di+jy nl 

_ ntd”!* ~ di+i + t — P) ^ -pr n — di+i + t — P 
Ut\'in-d^+i+t) n-d,+,+t 


nh 


p 


n — di+i + t 


< 11 -^ 

n 


/ X AU2 

■/ \ A,/£- 

<(1- — ) 

(1- — ) 

- V aJ 

[v aJ J 

due to the fact that 


(l-x)iA 

< e~^ for all x 


< e 


-e/2 


Therefore, the probability of vertex v failing to be clustered after t rounds is at most qt < Hence, 

we have that for any round i, the probability that any vertex has degree more than Ai/2 after t rounds is 
at most n ■ due to a simple union bound. If we want that that probability to be smaller than S, then 

n ■ < S ^ \nn — t ■ e/2 < ln(i5) ^ t > - ■ \n{n/5) 


Hence, with probability I —(5, after | dn(n/5) rounds either all nodes of degree greater than A/2 are clustered, 
or the maximum degree is decreased by half. Applying this argument log A times yields the result, as the 
maximum degree of the remaining graph becomes I. 

For (7/ the proof follows simply from the analogous proof of [12]. Consider any round of the algorithm, 
and break it into k steps (each step, for each vertex in A that becomes a cluster center). Let t be a vertex that 
has degree at most A/2, and is not active. During step 1 of round 1, the probability that v is not adjacent 
to 7r(l) is at most 1 — If u is not selected at step 1, then during step 2 of round 1, the probability that v 
is not adjacent to the next cluster center is again at most 1 — After processing all vertices in A, during 
the first round, either v was clustered, or its degree became strictly less than A/2, or the probability that 
neither of the previous happened is at most (1 — < 1 — e/2. It is easy to see that after 0{^ logn) 

rounds vertex v will have either been clustered or its degree would be smaller than A/2. Union bounding 
for n vertices and all rounds, we get that the max degree of the remaining graph gets halved after logn) 
rounds, hence the total number of rounds needed is at most 0{ \ logn log A), with high probability. □ 

®This follows from a simple calculation on the pdf of the hypergeometric distribution. 
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A.2 Running times 

In this section, we prove the running time theorem for our Algorithms. We first present the following recent 
graph-theoretic result. 


Theorem A.l (Theorem 1 in |13)). Let G be an undirected graph on n vertices, with maximum vertex degree 
A. Let us sample each vertex independently with probability p = and define as G' the induced subgraph 
on the activated vertices. Then, the largest connected component of the resulting graph G' has size at most 
O(^logn) with high probability. 


To apply Theorem A.l we first need to convert it into a result for sampling without replacement (instead 
of i.i.d. sampling). 


Lemma A.2. Let us define two sequences of binary random variables The first sequence 

comprises n i.i.d. Bernoulli random variables with probability p, and the second sequence a random subset 
of B random variables is set to 1 without replacement, where B is integer that satisfies 


{n + 1) ■ p — 1 < B < {n + 1) ■ p. 


Let us now define px =V {f{Xi ,..., A„) > G) for some f (in our case this will be the largest connected 
component of a subgraph defined on the sampled vertices) and some number C, and similarly define py ■ Let 
us further assume that we have an upper bound on the above probability px S. Then, py < n ■ S. 


Proof. By expanding px using law of total probability we have 


PX 


= ^P /(Ai,...,A„)>C 


6=0 


=6 -P = 5 




( 1 ) 


6=0 


\i=l 


where q^ is the probability that f{Xi ,..., A„) > G given that a uniformly random subset of b variables was 
set to 1. Moreover, we have 


PY = Y.^(f{Yy...,Y^)>G 

6-0 \ 

n / ^ \ 

6=0 \i=l / 

(«) , 

= QB ■ r 


= 6 .p y;y, = i, 


( 2 ) 


where (i) comes form the fact that P (/(hi,..., Y„) > C |X]r=i Yi = b) is the same as the probability that 
that f{Xi, ..., Xn) > C given that a uniformly random subset of b variables where set to 1, and (ii) comes 
from the fact that since we sample without replacement in y,we have that Yi = B always. 

If we just keep the b = B term in the expansion of px we get 


PX = 


6=0 


96 


Y.X, = b] >9b-P = i? = 


PY 


J2x, = b 


(3) 


^i=l 


^i=l 


\i=l 


since all terms in the sum are non-negative numbers. Moreover, since A^s are Bernoulli random variables, 
then X]r=i Binomially distributed with parameters n and p. We know that the maximum of the Binomial 
pmf with parameters n and p occurs at P (^ ■ Xi = B) where B is the integer that satisfies (n -I- 1) ■ p — 1 < 
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B < {n + 1) ■ p. Furthermore we know that the maximum value of the Binomial pmf cannot be less than 
that is 






( 4 ) 


If we combine ^ and Q we get px > Pv/n ^ py < n ■ 6. 


□ 


Corollary A.3. Let G be an undireeted graph on n vertices, with maximum vertex A. Let us sample e • ^ 
vertiees without replacement, and define as G' the indueed subgraph on the activated vertices. Then, the 
largest connected component of the resulting graph G' has size at most 


O 



with high probability. 

We use this in the proof of our theorem that follows. 

Theorem 2. The theoretical running time of C4, on P cores and e = 1/2, is upper bounded by 

' m + n log^ n 


O 


+ P log n • log A 


as long as the number of cores P is smaller than min^ where is the size of the bateh in the i-th 
round of each algorithm. The running time of ClusterWild! on P cores is upper bounded by 


O 


+ R logn ■ logA ^ 


for any eonstant e > 0. 

Proof. We start with analyzing C/, as the running time of ClusterWild! follows from a similar, and 
simpler analysis. Observe, that we operate on Bulk Synchronous Parallel model: we sample a batch of 
vertices, P cores asynchronously process the vertices in the batch, and once the batch is empty there is a 
bulk synchronization step. The computational effort spent by C4 can be split in three parts: i) computing 
the maximum degree, ii) creating the clusters, per batch, iii) syncronizing at the end of each batch. 

Computing A and synchronizing cost Computing A^ at the beginning of each batch, can be imple¬ 
mented in time ^ + logP, where each thread picks Ui/P vertices and computes locally their degrees, and 
inserts it to a sorted data structure {e.g., a B-tree that admits parallel operations), and then we get the 
largest item in logarithmic time. Moreover, the third part of the computation, i.e., synchronization among 
cores, can be done in 0{P). A little more involved argument is needed for establishing the running time of 
the second part, where the algorithms create the clusters. 


Clustering cost For a single vertex v sampled by a thread, the time required by the thread to process 
that vertex is the sum of the time needed to 1) wait inside the attemptCluster for preceding neighbors (by 
the order of tt), 2) “send” its 7r(u) to its neighbors, if is a cluster center, 3) if t is a cluster center, then 
for each u neighbors it will attempt to update clusterlD(M); however, this thread potentially competes with 
other threads that are attempting to write in clusterlD(u) at the same time. 

Using Corollary A.3 we can show that no more than O(logn) threads compete with each other at the 
same time, with high probability. Observe, that in our sampling scheme of batches of vertices, we are taking 
the first Bi = -^ ■ Ui elements of a random prefix tt. This is equivalent to sampling Bi vertices without 
replacement from the graph Gi of the current round. The result in Corollary |A.3| asserts that the largest 
connected component in the sampled subgraph is at most O(logn), with high probability. This directly 
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implies that a thread cannot be waiting for more than O(logn) other threads inside attemptCluster(ri). 
Therefore, the time spent by each thread to wait on other threads in attemptCluster(r;) is upper bounded 
by the number of maximum threads that it can be neighbors with (which assuming e is set to 1/2) is at most 
O(logn), times the time it takes each of these threads to be done with their execution, which is at most 
Ai log n (even assuming the worst case conflict pattern when updating at most entries in the clusterlD 
array). Hence, for C4 the processing time of a single vertex is upper bounded by 0{Ai ■ log^ n). 


Job allocation Now, observe that when each thread is done processing vertex, it picks the next vertex 
from A (if A is not empty). This process essentially models a classical greedy task allocation to cores, that 
leads to a 2 approximation in terms of the optimum weight allocation; here the optimum allocation leads to 
a max weight among cores that is at most equal to max(Ai, HiA^/P). This implies that the running time 
on P asynchronous threads of a single batch, is upperbounded by 


O 


^max 


^A, logn. 


BjAj log^n ^^ 


= O 


^max 


^Ai logn. 



Assuming, that the number of cores, is always less than the batch size (a reasonable assumption, as more 
cores, would not lead to further benefits), we obtain that the time for a single batch is 


O 


/ £/ m log^ n \ 

\P P )' 


Observe that a difference in ClusterWild!, is that waiting is avoided, hence, the running time, per 
batch of ClusterWild! is 

o(f+^+p 


Multiplying the above, with the number of rounds given by Lemma we obtain the theorem. 


□ 


A.3 Approximation Guarantees 

One can view the execution of ClusterWild! on G as having KwikCluster run on a “noisy version” of 
G. A main issue is that KwikCluster never allows two neighbors in the original graph to become cluster 
centers. Hence, since ClusterWild! ignores these edges among active vertices, one can view these edges as 
“adverserially” deleted. The major technical contribution of this work is to quantify how these “ignored” 
edges affect the quality of the output solution. The following simple lemma presented in our main text, is 
useful in quantifying the cost of the output clustering for any peeling algorithm. 

Lemma 5. The cost of any greedy algorithm that picks a vertex v (irrespective of the sampling order), creates 
Cy, peels it away and repeats, is equal to the number of bad triangles adjacent to each cluster center v. 

Proof. Consider the first step of the algorithm, for simplicity, and without loss of generality. Let us define 
as Tin the number of vertex pairs inside Cy that are not neighbors (i.e., they are joined by a negative edge). 
Moreover, let Tout denote the number of vertices outside Cy that are neighbors with vertices inside Cy. Then, 
the number of disagreements (i..e, number of misplaced pairs of vertices) generated by cluster Cy, is equal 
to Tin + Tout ■ 

Observe that all the Tin edges are negative, and all Tout are positive ones. Let for example (u, w) be one 
of the Tin negative edges inside Cy, hence both u, w belong to Cy (i.e., are neighbors with v). Then, {u, v, w) 
forms a bad triangle. Similarly, for every edge that is incident to a vertex in Cy, with one end point say 
u' G Cy and one w' G V\v, the triangle formed by {v,u',w'), is also a bad triangle. 

Hence, all edges that are accounted for in the final cost of the algorithm (i..e, total number of dis¬ 
agreements) are equal to the Tin + Tout bad triangles that include these edges and each cluster center per 
round. □ 
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Let us now consider the set of all cluster centers generated by ClusterWild!; call these vertices Ccw- 
Then, consider the graph G' that is generated by deleting all edges between Ccw- Observe that this is a 
random graph, since the set of edges deleted depends on the specific random sampling that is performed 
in ClusterWild!. We will use the following simple technical proposition to quantify how many more bad 
triangles G' has compared to G. 

Proposition A.4. Given any graph G with positive and negative edges, then let us obtain a graph Ge where 
we have removed a single edge, e from G. Then, the Gg has at most A more bad triangles compared to G. 

Proof. Let {i,j,k) be a bad triangle in G but not in Ge- Then it must be the case that e G t. WLOG let 
e = (bj)) and so k G N{i) U N{j). Since \N{i) U N{j)\ < 2max(degj, deg^-) < 2A, there can be at most A 
new bad triangles m. Ge- □ 

The above proposition is used to establish the Tnew bound for Lemma Now, assume a random permu¬ 
tation TT for which we run ClusterWild!, and let A = denote the union of all active sets of vertices, 

for each round r of the algorithm. Moreover, let G, denote the graph that is missing all edges between the 
vertices in the sets Ar- A simple way to bound the clustering error of ClusterWild!, is splitting it in 
to two terms: the number of old bad triangles of G adjacent to active vertices (i.e., we need to bound the 
expectation of the event that an active vertex is adjacent to an “old” triangle), plus the number of all new 
triangles induced by ignoring edges. Observe that this bound can be loose, since not all “new” bad triangles 
of G count towards the clustering error, and some “old” bad triangles can disappear. However, this makes 
the analysis tractable. Lemma then follows. 

Lemma 6. Let G denote the random graph induced by deleting all edges between active vertices per round, 
for a given run of ClusterWild!, and let Tnew denote the number of additional bad triangles that G has 
compared to G. Then, the expected cost of ClusterWild! can be upper bounded as 

IE \ ^ ^ Ip, “t“ Tnew 

iteTb 

where Vt is the event that triangle t, with end points i,j, k, is bad, and at least one of its end points becomes 
active, while t is still part of the original unclustered graph. 


B Implementation Details 

Our implementation is highly optimized in our effort to have practically scalable algorithms. We discuss 
these details in this section. 

B.l Atomic and non-atomic variables in Java/Scala 

In Java/Scala, processors maintain their own local cache of variable values, which could lead to spinlocks 
in Cf or greater errors in ClusterWild!. It is necessary to enforce a consistent view across all processors 
by the use of synchronization or AtomicReferences, but doing so will incur high overheads that render the 
algorithm not scalable. 

To mitigate this overhead, we exploit a monoticity property of our algorithms—the clusterlD of any 
vertex is a non-increasing value. Thus, many of the checks in Cf and ClusterWild! may be sufficiently 
performed using only an outdated version of clusterlD. Hence, we may maintain both an inconsistent but 
cheap clusterlD array as well as an expensive but consistent atomic clusterlD array. Most reads can be done 
using the cheap inconsistent array, but writes must propagate to the consistent atomic array. Since each 
clusterlD is written a few times but read often, this allows us to minimize the cost of synchronizing values 
without any substantial changes to the algorithm itself. 

We point out that the same concepts may be applied in a distributed setting to minimize communication 
costs. 
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B.2 Estimating but not computing A 

As written, the BSP variants require a computation of the maximum degree A at each round. Since this 
effectively involves a scan of all the edges, it can be an expensive operation to perform at each iteration. We 
instead use a proxy A which is initialized to A in the first round, and halved every | ln(nlog A/J) rounds. 

With a simple modification to Lemma we can see that w.h.p. any vertex with degree greater than A 
will either be clustered or have its degree halved after | ln(n log A/(5) rounds, so A upper-bounds A and our 
algorithms complete in logarithmic number of rounds. 

B.3 Lazy deletion of vertices and edges 

In practice, we do not remove vertices and edges as they are clustered, but simply skip over them when 
they are encountered later in the process. We find that this approach decreases the runtimes and overall 
complexity of the algorithm. (In particular, edges between vertices adjacent to cluster centers may never be 
touched in the lazy deletion scheme, but must nevertheless be removed in the proactive deletion approach.) 
Lazy deletions also allow us to avoid expensive mutations of internal data structures. 

B.4 Binomial sampling instead of fixed-size batches 

Lazy deletion does introduce an extra complication, namely it is now more difficult to sample a fixed-size 
batch of eUi/Ai vertices, where is the number of remaining unclustered vertices. This is because we do 
not maintain a separate set of Ui unclustered vertices, nor explicitly compute the value of n^. 

We do, however, maintain a set of unprocessed vertices, that is, a suffix of tt containing Ui unclustered 
vertices and clustered vertices that have not been passed through by the algorithm. We may therefore 
resort to an i.i.d. sampling of these vertices, choosing each with probability e/A^. Since processing an 
unprocessed but clustered vertex has no effect, we effectively simulate an i.i.d. sampling of the unclustered 
vertices. 

Furthermore, we do not have to actually sample each vertex—because tt is a uniform random permutation, 
it suffices to draw B ^ Binijii + mi, e/A^) and extract the next B elements from tt for processing, reducing 
the number of random draws from m mi Bernoullis to a single Binomial. 

All of our theorems hold in expectation when using i.i.d. sampling instead of fixed-size batches. 

B.5 Comment on CDK Implementation 

A crucial difference between the CDK algorithm and our algorithms lies in the fact that CDK might reject 
vertices from the active set, which are then placed back into the set of unclustered vertices for potential 
selection at later rounds. Conversely, our algorithms ensure that the active set is always completely processed, 
so any vertex that has been selected will no longer be selected in an active set again. We are therefore able 
to exploit a single random permutation tt and use the tricks with lazy deletions and binomial sampling that 
are not available to CDK, which instead has to perform the complete i.i.d. sampling. We believe that this 
accounts for the largest difference in runtimes between CDK and our algorithms. 

C Full experiment results 
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Figure 3; Empirical mean runtimes. For short, ‘CW’ is ClusterWild! and ‘As’ refers to the asynchronous variants. On larger graphs, 
our parallel algorithms on 3-4 threads are faster than serial KwikCluster. On the smaller graphs, the BSP variants have expensive 
synchronization barriers (relative to the small amount of actual done) and do not necessary run faster than serial KwikClustev, 

the asynchronous variants do outperform serial KwikCluster withM^S threads. We were only able to run CDK on the smaller graphs, 
for which CDK was 2-3 orders of magnitude slower than serial. Note also that the BSP variants have improved runtimes for larger e. 
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Figur6 4) Empirical mean speedups. The best speedups (14x on large graphs) are achieved by asynchronous ClusterWild! which 
has the least coordination, followed by asynchronous C4 (13x on large graphs). The BSP variants achieve up to lOx speedups on large 
graphs, with better speedups as e increases. On small graphs we pl^ain poorer speedups as the cost of any contention is magnified as 
the actual work done is comparatively small. There are a couple of^yiiks at 10 and 16 threads, which we postulate is due to NUMA and 
hyperthreading effects—the EC2 rS.Sxlarge instances are equipped with 10-core Intel Xeon E5-2670 v2 (Ivy Bridge) processors with 32 
vCPUs and hyperthreading. 
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Figure 5: Empirical objective values relative to mean objective value obtained by serial algorithm. 
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Figure 6: Empirical percentage of blocked vertices. Generally the number of blocked vertices increases 
with the number of threads and larger e values. C4 BSP has fewer blocked vertices than asynchronous C4, 
but at the cost of more synchronization barriers. We point out that across all 100 runs of every graphs, the 
maximum percentage of blocked vertices is less than C^^5%; for large sparse graphs, the maximum percentage 
is less than 0.025%, i.e., 1 in 4000. 


































































































































